Libraries
We load the packages relevant for the exercise.
library(FactoMineR)
library(tidyr)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(ggpubr)
library(factoextra)
library(gridExtra)
library(moments)
Screw Caps Data
The data ScrewCap.csv contains 195 lots of screw caps described by 11 variables. Diameter, weight, length are the physical characteristics of the cap; nb.of.pieces corresponds to the number of elements of the cap (the picture above corresponds to a cap with 2 pieces: the valve (clapet) is made of a different material); Mature.volume corresponds to the number of caps ordered and bought by the company (number in the lot).
raw_data <- read.table("ScrewCaps.csv",header=TRUE, sep=",", dec=".", row.names=1)
head(raw_data)
dim(raw_data)
[1] 195 11
summary(raw_data)
Supplier Diameter weight nb.of.pieces Shape Impermeability Finishing Mature.Volume Raw.Material Price
Supplier A: 31 Min. :0.4458 Min. :0.610 Min. : 2.000 Shape 1:134 Type 1:172 Hot Printing: 62 Min. : 1000 ABS: 21 Min. : 6.477
Supplier B:150 1st Qu.:0.7785 1st Qu.:1.083 1st Qu.: 3.000 Shape 2: 45 Type 2: 23 Lacquering :133 1st Qu.: 15000 PP :148 1st Qu.:11.807
Supplier C: 14 Median :1.0120 Median :1.400 Median : 4.000 Shape 3: 8 Median : 45000 PS : 26 Median :14.384
Mean :1.2843 Mean :1.701 Mean : 4.113 Shape 4: 8 Mean : 96930 Mean :16.444
3rd Qu.:1.2886 3rd Qu.:1.704 3rd Qu.: 5.000 3rd Qu.:115000 3rd Qu.:18.902
Max. :5.3950 Max. :7.112 Max. :10.000 Max. :800000 Max. :46.610
Length
Min. : 3.369
1st Qu.: 6.161
Median : 8.086
Mean :10.247
3rd Qu.:10.340
Max. :43.359
2) We start with univariate and bivariate descriptive statistics. Using appropriate plot(s) or summaries answer the following questions.
a) How is the distribution of the Price? Comment your plot with respect to the quartiles of the Price.
From the quantile data, the summary statistics are given by: median, 1Q and 3Q as 14.432, 11.864 and 19.04 respectively.
The plots, the kurtosis and the skewness parameters suggest the price follows a bimodal distribution that is “skewed right”.
The major mode is around 14 and the antimode is around 29. Furthermore, 50% of the prices in the range 11.864 and 19.04. This is consistent with graph where the majority of the density is concentrated inside this range and a long right tail of prices outside.
The boxplot supports this analyis and suggests the values in the tail are outliers.
price_density <- ggdensity(raw_data,x="Price",y = "..count..",
color="darkblue",
fill="lightblue",size=0.5,
alpha=0.2,
title = "Screw Cap Price Distribution",
linetype = "solid", add = c("median"))+ font("title", size = 12,face="bold")
price_boxplot <- ggboxplot(raw_data$Price, width = 0.1, fill ="lightgray", outlier.colour = "darkblue", outlier.shape=4.2, ylab = "Price", xlab = "Screw Caps" , title = "Price Box Plot") + rotate() + font("title", size = 12,face="bold")
price_quantile <- quantile(raw_data$Price)
ggarrange(price_density, price_boxplot, ncol = 1, nrow = 2)
price_quantile
0% 25% 50% 75% 100%
6.477451 11.807022 14.384413 18.902429 46.610372
skewness(raw_data$Price)
[1] 1.706151
kurtosis(raw_data$Price)
[1] 6.395453
b) Does the Price depend on the Length? weight?
We examine Price vs. Length, log(Price) vs. log(Length); Price vs. weight, log(Price) vs. log(weight) and provide the summary for each.
The plots suggest somewhat of a relationship between the variables, but observing the R-sqaured values and the results of the F and T tests confirm this to a high degree of significance.
price_length <- ggplot(raw_data, aes(x=Length, y=Price)) + geom_point() + geom_smooth(method=lm, color="darkgreen")+ theme_minimal()
price_length_log <- ggplot(raw_data, aes(x=log(Length), y=log(Price))) + geom_point() + geom_smooth(method=lm, color="darkgreen")+ theme_minimal()
price_weight <- ggplot(raw_data, aes(x=weight, y=Price)) + geom_point() + geom_smooth(method=lm,color="red")+theme_minimal()
price_weight_log <- ggplot(raw_data, aes(x=log(weight), y=log(Price))) + geom_point() + geom_smooth(method=lm,color="red")+theme_minimal()
ggarrange(ggarrange(price_length, price_length_log, ncol = 2, nrow = 1), ggarrange(price_weight, price_weight_log, ncol = 2, nrow = 1), ncol = 1, nrow = 2)
summary(lm(formula = Price ~ Length, raw_data))
Call:
lm(formula = Price ~ Length, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-13.901 -2.854 -0.741 1.931 16.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.94613 0.50918 17.57 <2e-16 ***
Length 0.73168 0.03953 18.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.308 on 193 degrees of freedom
Multiple R-squared: 0.6397, Adjusted R-squared: 0.6378
F-statistic: 342.6 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = log(Price) ~ log(Length), raw_data))
Call:
lm(formula = log(Price) ~ log(Length), data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-0.70368 -0.15501 -0.01661 0.15170 0.59211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.56380 0.07278 21.49 <2e-16 ***
log(Length) 0.53875 0.03282 16.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2466 on 193 degrees of freedom
Multiple R-squared: 0.5827, Adjusted R-squared: 0.5805
F-statistic: 269.5 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = Price ~ weight, raw_data))
Call:
lm(formula = Price ~ weight, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-14.7993 -2.6207 -0.6631 2.5396 13.8357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.2275 0.5602 14.69 <2e-16 ***
weight 4.8312 0.2718 17.78 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.419 on 193 degrees of freedom
Multiple R-squared: 0.6208, Adjusted R-squared: 0.6189
F-statistic: 316 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = log(Price) ~ log(weight), raw_data))
Call:
lm(formula = log(Price) ~ log(weight), data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-0.71123 -0.15340 -0.01343 0.17735 0.69552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50618 0.02333 107.42 <2e-16 ***
log(weight) 0.56453 0.03718 15.18 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2577 on 193 degrees of freedom
Multiple R-squared: 0.5443, Adjusted R-squared: 0.5419
F-statistic: 230.5 on 1 and 193 DF, p-value: < 2.2e-16
c) Does the Price depend on the Impermeability? Shape?
Concerning Impermeability, the plots below show that there are some striking differences between the price distribution for Type 1 and Type 2, in particular observing the medians and the IQR.
Concerning Shapes, it is difficult to make any real conclusions regarding shape 3 and shape 4 given there are so few data points. We turn our attention to Shape 1 and Shape 2 - the IQR for these two shapes are seemingly different. This is confirmed by the result of the T Test.
impermability_plot_1 <- ggdotplot(raw_data,x="Impermeability",y="Price",color = "Impermeability", palette = "jco",binwidth = 1,legend="none")
shape_plot_1 <- ggdotplot(raw_data,x="Shape",y="Price",color = "Shape", palette = "npg",binwidth = 1,legend="none")
impermability_plot_2 <- ggboxplot(raw_data,x="Impermeability",y="Price",color = "Impermeability", palette = "jco",legend="none")
shape_plot_2 <- ggboxplot(raw_data,x="Shape",y="Price",color = "Shape", palette = "npg", legend = "none")
ggarrange(ggarrange(impermability_plot_1,impermability_plot_2,ncol = 2, nrow = 1),
ggarrange(shape_plot_1,shape_plot_2,ncol = 2, nrow = 1),
ncol = 1, nrow = 2)
summary(lm(Price~ Impermeability, data=raw_data))
Call:
lm(formula = Price ~ Impermeability, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-16.4106 -3.0187 -0.6286 2.4897 25.0638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.7236 0.4117 35.77 <2e-16 ***
ImpermeabilityType 2 14.5846 1.1986 12.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.399 on 193 degrees of freedom
Multiple R-squared: 0.4341, Adjusted R-squared: 0.4312
F-statistic: 148 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(Price~ Shape, data=raw_data))
Call:
lm(formula = Price ~ Shape, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-11.098 -3.850 -1.025 3.055 25.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.2006 0.5406 26.267 < 2e-16 ***
ShapeShape 2 8.1403 1.0782 7.550 1.75e-12 ***
ShapeShape 3 1.4510 2.2777 0.637 0.52485
ShapeShape 4 7.4393 2.2777 3.266 0.00129 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.258 on 191 degrees of freedom
Multiple R-squared: 0.2475, Adjusted R-squared: 0.2357
F-statistic: 20.94 on 3 and 191 DF, p-value: 9.008e-12
d) Which is the less expensive Supplier?
The answer to this question depends on the definition of expensive.
First, examine the following absolute metrics (this can be seen via the boxplot) 1) Absolute price - Supplier B cheapest (6.477451). However, Supplier B is also the supplier which has the highest absolute price (46.610372) 2) Average Price - Supplier C cheapest (14.88869)
Second, examine the following relative metrics:
3) Average Price / Unit Length - Supplier A (1.505043) 4) Average Price / Unit weight - Supplier A (9.013902) 5) Average Price / Unit Diameter - Supplier A (11.95632) 6) Average Price / Unit Mature.Volume - Supplier B (1.663305)
The result above suggest Supplier A has the cheapest average price per unit of production.
The analysis however is not complete given we do not have a definition of cheapest price. Even the scatter and box plots below suggest suppliers may cater to specific product ranges. We also have little data for Supplier C. We have not performed statistical tests to examine the significance of these differences.
Furthermore, the analysis ignores the categorical data which could provide some insights into cheapest price for certain product features.
supplier_plot_1 <- ggboxplot(raw_data,x="Supplier",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),legend="none") + rotate()
supplier_plot_2 <- ggscatter(raw_data,x="Length",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_plot_3 <- ggscatter(raw_data,x="weight",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_plot_4 <- ggscatter(raw_data,x="Diameter",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_statistics <- raw_data %>% group_by(Supplier) %>% summarise( "Average Price" = mean(Price), "Average Length" = mean(Length),"Average weight" = mean(weight),"Average Diameter" = mean(Diameter),"Average Mature.Volume" = mean(Mature.Volume) ,"Average Price / Length" = mean(Price)/mean(Length), "Average Price / weight" = mean(Price)/mean(weight), "Average Price / Diameter" = mean(Price)/mean(Diameter),"Average Price / Mature.Volume" = mean(Price)/mean(Mature.Volume))
supplier_plot_1
supplier_plot_2
supplier_plot_3
supplier_plot_4
head(supplier_statistics)
3) One important point in explanatory data analysis consists in identifying potential outliers. Could you give points which are suspect regarding the Mature.Volume variable? Give the characteristics (other features) of the observations that seem suspsect
There are four data points which seem suspect - they have the same characteristics for Diameter, weight, nb.of.pieces, Impermeability, Finishing, Raw.Material and Mature.Volume. They differ in their supplier, price and length. These suggest some error in collating the data (system error / default data).
Mature.Volume_plot <- gghistogram(raw_data,x="Mature.Volume",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
Mature.Volume_plot
raw_data %>% filter (Mature.Volume > 6e+05 )
For the rest of the analysis, the 4 data points above are disregarded.
library(dplyr)
raw_data <- raw_data %>% filter (Mature.Volume < 6e+05 )
We will quickly check that there are no other noticeable outliers - this is indeed the case.
check_1 <- gghistogram(raw_data,x="Length",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
check_2 <- gghistogram(raw_data,x="Diameter",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
check_3 <- gghistogram(raw_data,x="weight",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
check_4 <- gghistogram(raw_data,x="nb.of.pieces",y="..count..", color = "darkblue", fill = "lightgrey",bins=40) + theme_minimal()
ggarrange(ggarrange(check_1,check_2,ncol=2,nrow=1),ggarrange(check_3,check_4,ncol=2,nrow=1),ncol=1,nrow=2)
4) Perform a PCA on the dataset ScrewCap, explain briefly what are the aims of a PCA and how categorical variables are handled?
Principal components analysis (PCA) is a technique for taking high-dimensional data, and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information - we try capture the essence of high dimentional data in a low dimensional representation. The aim of PCA is to draw conclusions from the linear relationships between variables by detecting the principal dimensions of variability. This may be for compression, denoising, data completion, anomaly detection or for preprocessing before supervised learning (improve performance / regularization to reduce overfitting).
The categorical variables cannot be represented in the same way as the supplementary quantitative variables since it is not possible to calculate the correlation between a categorical variable and the principal components. The categorical variables here are handled as supplemetary variables on a purely illustrative basis - they are not used to calculate the distance between inidividuals. We represent a categorical variable at the barycentre of all the individuals possessing that variable. A categorical variable on the PCA performed below can therefore be regarded as the mean individual obtained from the set of individuals who have it.
Given our ultimate goal here is to explore data prior to a multiple regression, it is advisable to choose the explanatory variables for the regression model as active variables for PCA, and to project the variable to be explained (the dependent variable) as a supplementary variable. This gives some idea of the relationships between explanatory variables and thus of the need to select explanatory variables. This also gives us an idea of the quality of the regression: if the dependent variable is appropriately projected, it will be a well-fitted model. Thus we select Price as a supplementary variable.
The dataset in this exercise contains 6 supplementary variables: - 1 quantitative variable (Price) - 5 qualitative variables (Supplier, Shape, Impermeability and Finishing).
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10, graph = FALSE,scale = TRUE)
fviz_pca_ind(res.pca, col.ind="cos2", label=c("quali"), geom = "point", title = "Individual factor map (PCA)", habillage = "none") + scale_color_gradient2(low="lightblue", mid="blue", high="darkblue", midpoint=0.6) + theme_minimal()
plot.PCA(res.pca,choix = c("ind"),invisible = c("ind"))+theme_minimal()
NULL
plot.PCA(res.pca,choix = c("var"))+theme_minimal()
NULL
5) Compute the correlation matrix between the variables and comment it with respect to the correlation circle
The first task is to center and standardize the variables. Then the correlation matrix is computed. All variable vectors are quite near to the boundary of the correlation circle on the variables plot - thus the variables are relatively well projected on the 2 dimensional subspace. We now turn our attention to correlations between variables. The correlations can be visualised through the angles between variables on the correlation matrix. This can be related to the correlation matrix (small angles suggest large positive correlation, 90 degree angles suggest no correlation, 180 degree angles suggest large negative correlation). - Diameter, Length and weight expose very strong corrleation: the angle between them is close to 0, suggesting correlation close to 1. These are all very highly correlated to the first Principal Component. - The three variables above are at an angle sightly wider than a right angle to both nb.of.pieces and Mature.Volume in the cirlce which suggests slightly negative correlation. - Price is highlighly correlatd to the three variables above - Equally, Mature.Volume and nb.of.pieces are at a slightly wider angle than a right angle which suggests slightly negative correlation - this suggests that when the screw caps have a high number of pieces, the company orders a smaller volume of these. These are well projected on the second principal component.
don <- as.matrix(raw_data[,-c(1,5,6,7,9,10)]) %>% scale()
don_correlation <- cor(don)
don_correlation
Diameter weight nb.of.pieces Mature.Volume Length
Diameter 1.0000000 0.9622544 -0.14869500 -0.29164724 0.9996963
weight 0.9622544 1.0000000 -0.16884367 -0.31321323 0.9627460
nb.of.pieces -0.1486950 -0.1688437 1.00000000 -0.07462463 -0.1463770
Mature.Volume -0.2916472 -0.3132132 -0.07462463 1.00000000 -0.2936330
Length 0.9996963 0.9627460 -0.14637705 -0.29363295 1.0000000
plot.PCA(res.pca,choix = c("var"))+theme_minimal()
NULL
6) On what kind of relationship PCA focuses? Is it a problem?
PCA focuses on the linear relationships between continuous variables. Given complex links also exist, such as quadratic relationships, logarithmics, exponential functions, and so forth, this may seem restrictive, but in practice many relationships can be considered linear, at least for an initial approximation. However, there is obviously non-linear datasets for which PCA will have pitfalls (e.g. spiral dataset). Furthermore, in PCA categorical variables cannot be active variables, which can be restrictive.
7) Comment the PCA outputs ######FINISH THIS ###### WILKS TEST FROM FACTOMINER
Comment the position of the categories Impermeability=type 2 and Raw.Material=PS.
The coordinates for Type 2 are (3.28 , 0.01) for the first two principal components. It has a very significant p value for Dim 1 and thus the coordinate is significantly different from 0 on the first component. The coordinate for PS are (2.67, -0.25) for the first two principal components. It has a very significant p value for Dim 1 and thus he coordinate is significantly different from 0 on the first component.
Furthermore, given the correlation circle shows high correlation between the first component and price, diameter, length and weight, this suggest Type 2 and PS have high values for these variables.
In fact, looking at the p-values we can say that both of the categories Type 1, Type 2, PS and PP have coordinates that are significantly different from 0 on the first component. As the value is positive (negative) for the Type 2 / PS (Type 1 / PP) we can say that the rows which include Type 2 / PS tended to have positive coordinates (or negative, respectively) on component 1 and thus are more correlated with the variables mentioned above (price,diameter, length and weight).
Finally, we also consider the results from the wilk’s test performed in the FactoInvestigate package. This indicates which variables are the best seperated on the plane (i,e, which one explain the best the distance between individuals, and the best qualitative variables to illutrate distance between individuals on the plane are Impermeability and Raw.Material.
res.pca$quali.sup$coord
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Supplier A 0.54805992 -0.054566515 -0.214051234 0.0227636641 0.0058684306
Supplier B -0.06543165 -0.125589918 -0.026949041 0.0018781980 -0.0006195266
Supplier C -0.44356100 1.440695488 0.728281700 -0.0670085402 -0.0056067533
Shape 1 -0.42564773 -0.137916238 -0.214123559 0.0065253174 -0.0009308749
Shape 2 1.42726960 0.394279456 0.383010989 -0.0314353290 0.0018793388
Shape 3 -0.55969671 -0.332207048 0.059604995 0.1360698967 -0.0029996610
Shape 4 -0.55191919 0.355523978 1.265466019 -0.0652825793 0.0075550964
Type 1 -0.45031131 -0.001823621 -0.009194259 0.0008200692 -0.0005392760
Type 2 3.28923043 0.013320364 0.067158065 -0.0059900708 0.0039390597
Hot Printing -0.28600503 -0.037712714 0.192161713 0.0717729126 -0.0006010224
Lacquering 0.13745978 0.018125491 -0.092356792 -0.0344955084 0.0002888635
ABS 0.87599666 0.220028373 -0.581512708 0.0043591307 -0.0032513149
PP -0.61062316 0.013651457 0.120658551 0.0042947466 -0.0005390562
PS 2.67437709 -0.253323291 -0.198579404 -0.0273071254 0.0056116038
dimdesc(res.pca, 1:2)
$Dim.1
$Dim.1$quanti
correlation p.value
Length 0.9853764 3.259183e-147
Diameter 0.9851090 1.784008e-146
weight 0.9774643 1.263294e-129
Price 0.7960132 4.472456e-43
nb.of.pieces -0.2017085 5.139018e-03
Mature.Volume -0.4118157 3.243173e-09
$Dim.1$quali
R2 p.value
Impermeability 0.4767041 2.203784e-28
Raw.Material 0.4309747 9.602186e-24
Shape 0.2024825 3.268025e-09
$Dim.1$category
Estimate p.value
Type 2 1.8697709 2.203784e-28
PS 1.6944602 3.078822e-20
Shape 2 1.4547681 6.874053e-11
ABS -0.1039202 1.566216e-02
Shape 1 -0.3981492 5.692581e-07
PP -1.5905400 1.465743e-20
Type 1 -1.8697709 2.203784e-28
$Dim.2
$Dim.2$quanti
correlation p.value
nb.of.pieces 0.8426737 1.045066e-52
Price 0.1706662 1.824955e-02
Mature.Volume -0.5956751 9.962493e-20
$Dim.2$quali
R2 p.value
Supplier 0.15447684 1.411880e-07
Shape 0.05575798 1.311396e-02
$Dim.2$category
Estimate p.value
Supplier C 1.0205158 1.996715e-08
Shape 2 0.3243594 3.249853e-03
Shape 1 -0.2078363 6.889304e-03
Supplier B -0.5457696 1.703585e-03
wilks.p <-structure(c(3.63471614445021e-27, 1.29050347661083e-22, 4.08870696535012e-10,
3.73163828614179e-07, 0.284227353624445), .Names = c("Impermeability",
"Raw.Material", "Shape", "Supplier", "Finishing"))
wilks.p
Impermeability Raw.Material Shape Supplier Finishing
3.634716e-27 1.290503e-22 4.088707e-10 3.731638e-07 2.842274e-01
Comment the percentage of inertia
Below in the Scree plot we see the percentage of inertia explained by each component. The first two components explain 83.48% of the total dataset inertial - this means that 83.50% of the individuals (or variables) cloud total variability is explained by this plane. Over 95% of the variance can be explained with the 3 first synthetic vectors in the PCA.
We can also see that the variance of the first component is explained in majority by Diameter, Length and weight as expected. In the second and third dimension by nb.of.pieces and Mature.Volume respectively representing large contribution (~60%).
res.pca$var$contrib
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Diameter 31.232761 0.06080408 2.673943 16.37238500 4.966011e+01
weight 30.749891 0.07707453 1.371044 67.79961131 2.379254e-03
nb.of.pieces 1.309454 66.55678435 32.075772 0.05771098 2.791480e-04
Mature.Volume 5.458176 33.25770535 61.213010 0.07097644 1.329131e-04
Length 31.249719 0.04763169 2.666232 15.69931627 5.033710e+01
8) Give the R object with the two principal components which are the synthetic variables the most correlated to all the variables.
These are found in the code below -
res.pca$var$coord[,1:2]
Dim.1 Dim.2
Diameter 0.9851090 -0.02547004
weight 0.9774643 -0.02867601
nb.of.pieces -0.2017085 0.84267375
Mature.Volume -0.4118157 -0.59567509
Length 0.9853764 -0.02254298
9) PCA is often used as a pre-processing step before applying a clustering algorithm, explain the rationale of this approach and how many components k you keep.
We chose the maximum number of components as to not to lose any significant information whilst discarding the components that can be considered as noise. Consequently, we keep the number of dimensions such that we keep 95% of the inertia in PCA, which is equivalent to 3 components in our analysis.
10) Perfoms a kmeans algorithm on the selected k principal components of PCA. How many cluster are you keeping? Justify.
We us the Elbow method and look at the knee to determine the number of clusters we keep 3 clusters here.
Recall that, the basic idea behind k-means clustering is to define clusters such that the total within-cluster variation is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible. The Elbow method looks at the total WSS as a function of the number of clusters: one should choose a number of clusters so that the marginal cluster doesn’t improve WSS. Here this value is 3.
#Perform PCA on 3 Principal Components
res.pca_3 <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10, ncp=3, graph = FALSE, scale = TRUE)
#Use the Elbow method to determine the number of clusters to keep
fviz_nbclust(res.pca_3$ind$coord, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)
#Kmeans Algorithm
k_cluster <- kmeans(res.pca_3$ind$coord, 3)
k_mean_data <- as.matrix(res.pca_3$ind$coord)
plot(k_mean_data, col = k_cluster$cluster, pch = 20, frame = FALSE, main = "K-means")
11) Perform an AHC on the selected k principal components of PCA.
Below we perform an AHC on the select 3 principal components of PCA.
res.pca_3 <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10, ncp=3, graph = FALSE, scale = TRUE)
res.hcpc <- HCPC(res.pca_3, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
plot.HCPC(res.hcpc, choice = 'map', draw.tree = FALSE, select = "cos2 5", title = 'Factor Map')
plot.HCPC(res.hcpc, choice = 'bar', draw.tree = FALSE, title = 'Inertia Gain')
plot.HCPC(res.hcpc, choice = '3D.map', draw.tree = FALSE, title = 'Hierarchical Clustering on the Factor Map', angle=60)
plot.HCPC(res.hcpc, choice = 'tree', draw.tree = FALSE, title = 'Hierarchical Clustering')
res.hcpc$call$t$within[1:5]
[1] 4.950897 2.441761 1.644233 1.264480 1.009847
res.hcpc$call$t$within[1:5]
[1] 4.950897 2.441761 1.644233 1.264480 1.009847
1 - (res.hcpc$call$t$within[3]/res.hcpc$call$t$within[1])
[1] 0.6678919
12) Comments the results and describe precisely one cluster.
The classfication made on individuals reveals 3 clusters.
We now aim at describing the clusters.
For each quantitative variable, we build an analysis of variance model between the quantitative variable and the class variable, do a Fisher test to detect class effect and sort the variables by increasing p-value. We can directly find the results in the following objects:
res.hcpc$desc.var$quanti.var
Eta2 P-value
Length 0.8036361 3.530117e-67
Diameter 0.8025769 5.853470e-67
weight 0.8013378 1.053993e-66
Mature.Volume 0.7588382 8.649758e-59
Price 0.4812030 1.620918e-27
nb.of.pieces 0.1760507 1.243497e-08
For high values of Eta2, as seen in the lecture, there is a relation between the clustering and the quantitative variable. Thus we can say this is the case for Length, Diameter, weight and Mature.Volume.
We now turn our attention to describing the observations in a cluster using quantitative variables.
res.hcpc$desc.var$quanti
$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 11.942982 2.431183e+05 82206.026178 67166.762125 9.103190e+04 7.064414e-33
Diameter -3.255425 8.214269e-01 1.294639 0.254233 9.821218e-01 1.132228e-03
Length -3.297003 6.491733e+00 10.329589 2.056760 7.864783e+00 9.772253e-04
weight -3.536244 1.100262e+00 1.714121 0.315574 1.172854e+00 4.058595e-04
nb.of.pieces -3.780986 3.324324e+00 4.115183 1.274576 1.413225e+00 1.562083e-04
Price -3.857939 1.245686e+01 16.552332 4.115901 7.172431e+00 1.143473e-04
$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 5.739229 4.503759 4.115183 1.352381e+00 1.413225e+00 9.510856e-09
Price -2.999298 15.521715 16.552332 4.620374e+00 7.172431e+00 2.706026e-03
weight -5.297059 1.416482 1.714121 3.882302e-01 1.172854e+00 1.176825e-07
Length -5.533133 8.244766 10.329589 2.492726e+00 7.864783e+00 3.145605e-08
Diameter -5.565997 1.032748 1.294639 3.121379e-01 9.821218e-01 2.606582e-08
Mature.Volume -8.031930 47177.255639 82206.026178 3.971314e+04 9.103190e+04 9.595124e-16
$`3`
v.test Mean in category Overall mean sd in category Overall sd p.value
Length 12.298789 30.295407 10.329589 7.979008e+00 7.864783e+00 9.194180e-35
Diameter 12.294570 3.787033 1.294639 1.000521e+00 9.821218e-01 9.687099e-35
weight 12.254018 4.680731 1.714121 1.164251e+00 1.172854e+00 1.598746e-34
Price 9.282815 30.295414 16.552332 8.814239e+00 7.172431e+00 1.650583e-20
Mature.Volume -3.281665 20542.857143 82206.026178 1.547128e+04 9.103190e+04 1.031962e-03
nb.of.pieces -3.659694 3.047619 4.115183 7.221786e-01 1.413225e+00 2.525166e-04
Below, we also illustrate clusters by considering paragons and specific indivuals.
res.hcpc$desc.ind
$para
Cluster: 1
94 95 142 74 144
0.3280604 0.3376965 0.4205536 0.4265863 0.5886435
---------------------------------------------------------------------------------------------------------------------------------
Cluster: 2
82 149 129 148 36
0.3560908 0.3691347 0.3744143 0.3849945 0.3990805
---------------------------------------------------------------------------------------------------------------------------------
Cluster: 3
2 3 157 1 160
0.4151628 0.4690793 0.5302929 0.9668707 1.0514049
$dist
Cluster: 1
77 104 105 102 146
4.439104 4.427938 3.643115 3.369384 3.000193
---------------------------------------------------------------------------------------------------------------------------------
Cluster: 2
183 45 178 191 188
4.955342 3.980645 3.691443 3.447450 3.447219
---------------------------------------------------------------------------------------------------------------------------------
Cluster: 3
85 84 159 156 158
7.892354 7.761370 7.666985 6.749803 6.732176
Finally, we also look at the cateegorical variables for clues on clusters:
res.hcpc$desc.var$test.chi2
p.value df
Impermeability 5.318642e-18 2
Raw.Material 5.226547e-17 4
Shape 5.626207e-06 6
Supplier 4.102258e-02 4
The variable Impermeability and Raw.Materials seem to be very siginificatively linked to partioning, with Shape and Supplier also significant to a certaint extent. For each qualitative variable, we perform a chi2 test between it and the class variable and then sort the variables by increasing p-value
res.hcpc$desc.var$category
$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 25.000000 97.297297 75.392670 0.0001368886 3.813724
Impermeability=Type 1 22.023810 100.000000 87.958115 0.0049829303 2.808135
Supplier=Supplier C 0.000000 0.000000 7.329843 0.0434829294 -2.019041
Raw.Material=PS 3.846154 2.702703 13.612565 0.0222292828 -2.286427
Raw.Material=ABS 0.000000 0.000000 10.994764 0.0081544536 -2.645607
Impermeability=Type 2 0.000000 0.000000 12.041885 0.0049829303 -2.808135
Shape=Shape 2 2.222222 2.702703 23.560209 0.0002330416 -3.680210
$`2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 74.40476 93.984962 87.958115 0.0002868332 3.626910
Supplier=Supplier C 100.00000 10.526316 7.329843 0.0050545222 2.803538
Raw.Material=PP 74.30556 80.451128 75.392670 0.0173789787 2.378590
Raw.Material=PS 38.46154 7.518797 13.612565 0.0004703729 -3.497084
Impermeability=Type 2 34.78261 6.015038 12.041885 0.0002868332 -3.626910
$`3`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 2 65.2173913 71.428571 12.04188 2.909966e-12 6.982005
Raw.Material=PS 57.6923077 71.428571 13.61257 4.169068e-11 6.597941
Shape=Shape 2 31.1111111 66.666667 23.56021 9.932485e-06 4.418638
Shape=Shape 1 5.3846154 33.333333 68.06283 6.715709e-04 -3.400930
Impermeability=Type 1 3.5714286 28.571429 87.95812 2.909966e-12 -6.982005
Raw.Material=PP 0.6944444 4.761905 75.39267 2.869539e-13 -7.300381
We can confirm that the these qualitative variables play a significant role in each of the clusters individually as well.
Finally, we link this all back to our PCA -
res.hcpc$desc.axes
$quanti.var
Eta2 P-value
Dim.1 0.8309664 2.688258e-73
Dim.2 0.5350712 5.424124e-32
Dim.3 0.2440799 3.773433e-12
$quanti
$quanti$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Dim.3 6.487292 0.8462912 1.421520e-15 0.8757617 0.8814013 8.739298e-11
Dim.1 -4.527644 -1.1812290 -5.988883e-16 0.4424802 1.7627029 5.964496e-06
Dim.2 -9.411906 -1.4388767 1.737821e-15 0.7723735 1.0329119 4.872192e-21
$quanti$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Dim.2 9.409551 0.4656328 1.737821e-15 0.7182786 1.0329119 4.982628e-21
Dim.1 -4.493869 -0.3794992 -5.988883e-16 0.5207778 1.7627029 6.994079e-06
Dim.3 -6.203223 -0.2619404 1.421520e-15 0.7584515 0.8814013 5.531849e-10
$quanti$`3`
v.test Mean in category Overall mean sd in category Overall sd p.value
Dim.1 12.32585 4.484708 -5.988883e-16 1.647504 1.762703 6.574308e-35
attr(,"class")
[1] "catdes" "list "
The information above suggests there is strong relationships between the clusters and the dimensions: cluster 1 - have coordinates signicantly smaller to 0 in the first dimension and second dimension but coordinates signficantly greater than 0 in the third dimension. cluster 2 - have coordinates signicantly smaller to 0 in the first dimension and third dimension but coordinates signficantly greater than 0 in the second dimension. cluster 3 - have coordinates signicantly greater to 0 in the first dimension.
This is extremely insightful when we consider the analysis we have performed previously.
In english, the summary of the above analysis can be written as:
The cluster 1 is made of individuals such as specific individual 77 sharing : - high values for the variable Mature.Volume. - low values for the variables nb.of.pieces, Price, weight, Length and Diameter (variables are sorted from the weakest).
The cluster 2 is made of individuals sharing such as specific individual 183 sharing: - high values for the variable nb.of.pieces. - low values for the variables Mature.Volume, Diameter, Length, weight and Price (variables are sorted from the weakest).
The cluster 3 is made of individuals such as specific individual 84 sharing : - high values for the variables Length, Diameter, weight and Price (variables are sorted from the strongest). - low values for the variables nb.of.pieces and Mature.Volume (variables are sorted from the weakest).
13) If someone ask you why you have selected k components and not k + 1 or k − 1, what is your answer? (could you suggest a strategy to assess the stability of the approach? - are there many differences between the clustering obtained on k components or on the initial data)
In the Husson textbook, it is suggested to test the percentage intertia expressed by a component and then the percentage of inertia expressed by the first plane. The method simulates 10,000 datasets for 191 individual sand 11 normally distributed independent variables - this allow us to make a comparison by taking into account the number of active individuals and active variables. It then conduct a standardised PCA for each dataset and calculates the percentage of inertia explained by one component and the that expressed by one plane. Then, the method defines the quantile 0.95 of the 10,000 percentages of inertia of the first component (and the first plane, respectively) obtained. Comparing the percentage of inertia of a component or plane with the associated value in the table then amounts to testing the null hypothesis H0: the percentage of inertia explained by the first component (and the first plane, respectively) is not significantly greater than that obtained with the (normally distributed) independent data.
The first two dimensions of PCA express 83.48% of the total dataset inertia - this value is strongly greater than the reference value that equals 48.22%. This would suggest it is probably not useful to interpret the next dimensions.
However, this method might discard significant information. Consequently, will use the methodology previously
A strategy to assess the stability of the approach is to do the same study that we did but for the hcpc with 2 and 4 compenent. Then, we will be able to compare the different p-values for each configuration and see going form 2 to 3 or from 3 to 4 have a significant impact on the results of our clustering. After doing this study we noticed that we have the same p-value results for \(k=3\) and \(k=4\). Plus, we noticed that we have the p-value results are in average better for \(k=3\) than \(k=2\). This insights explain us that 3 components is the best compromise and choice. On the same vein, we noticed that there a no difference in terms of cluster description for a clustering obtained on k components or on the initial data.
The methodology that we have used to describe clusters can also be used to describe a categorical variable, for instance the supplier.
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=4)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=3)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=2)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc
**Results for the Hierarchical Clustering on Principal Components**
name description
1 "$data.clust" "dataset with the cluster of the individuals"
2 "$desc.var" "description of the clusters by the variables"
3 "$desc.var$quanti.var" "description of the cluster var. by the continuous var."
4 "$desc.var$quanti" "description of the clusters by the continuous var."
5 "$desc.var$test.chi2" "description of the cluster var. by the categorical var."
6 "$desc.axes$category" "description of the clusters by the categories."
7 "$desc.axes" "description of the clusters by the dimensions"
8 "$desc.axes$quanti.var" "description of the cluster var. by the axes"
9 "$desc.axes$quanti" "description of the clusters by the axes"
10 "$desc.ind" "description of the clusters by the individuals"
11 "$desc.ind$para" "parangons of each clusters"
12 "$desc.ind$dist" "specific individuals"
13 "$call" "summary statistics"
14 "$call$t" "description of the tree"
plot.HCPC(res.hcpc, choice = 'map', draw.tree = FALSE, title = '', select=c("12"))
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=3)
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc
**Results for the Hierarchical Clustering on Principal Components**
name description
1 "$data.clust" "dataset with the cluster of the individuals"
2 "$desc.var" "description of the clusters by the variables"
3 "$desc.var$quanti.var" "description of the cluster var. by the continuous var."
4 "$desc.var$quanti" "description of the clusters by the continuous var."
5 "$desc.var$test.chi2" "description of the cluster var. by the categorical var."
6 "$desc.axes$category" "description of the clusters by the categories."
7 "$desc.axes" "description of the clusters by the dimensions"
8 "$desc.axes$quanti.var" "description of the cluster var. by the axes"
9 "$desc.axes$quanti" "description of the clusters by the axes"
10 "$desc.ind" "description of the clusters by the individuals"
11 "$desc.ind$para" "parangons of each clusters"
12 "$desc.ind$dist" "specific individuals"
13 "$call" "summary statistics"
14 "$call$t" "description of the tree"
Characterization of each supplier
14) The methodology that you have used to describe clusters can also be used to describe a categorical variable, for instance the supplier. Use the function catdes(data, num.var=1) and explain how this information can be useful for the company.
catdes(raw_data, num.var=1)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 9.049049e-05 4
Impermeability 1.088731e-02 2
$category
$category$`Supplier A`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PS 42.30769 37.93103 13.61257 0.0002998155 3.615459
Impermeability=Type 2 34.78261 27.58621 12.04188 0.0130149176 2.483361
Shape=Shape 2 26.66667 41.37931 23.56021 0.0213728107 2.301333
Raw.Material=ABS 0.00000 0.00000 10.99476 0.0254288561 -2.234825
Impermeability=Type 1 12.50000 72.41379 87.95812 0.0130149176 -2.483361
$category$`Supplier B`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=ABS 100.00000 14.18919 10.99476 0.003330616 2.935453
Raw.Material=PS 57.69231 10.13514 13.61257 0.015928453 -2.410551
Shape=Shape 2 60.00000 18.24324 23.56021 0.002374481 -3.038894
$category$`Supplier C`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 9.722222 100 75.39267 0.01626019 2.403023
$quanti.var
Eta2 P-value
nb.of.pieces 0.2137072 1.530822e-10
$quanti
$quanti$`Supplier A`
NULL
$quanti$`Supplier B`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -2.817845 3.959459 4.115183 1.240523 1.413225 0.004834708
$quanti$`Supplier C`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 6.345875 6.428571 4.115183 1.720228 1.413225 2.211654e-10
attr(,"class")
[1] "catdes" "list "
catdes(raw_data, num.var=5)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Impermeability 2.873602e-16 3
Raw.Material 8.762044e-07 6
Finishing 1.040072e-02 3
$category
$category$`Shape 1`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 76.785714 99.2307692 87.95812 1.043033e-11 6.800436
Raw.Material=PP 72.222222 80.0000000 75.39267 3.596432e-02 2.097331
Finishing=Lacquering 72.868217 72.3076923 67.53927 4.420603e-02 2.012132
Finishing=Hot Printing 58.064516 27.6923077 32.46073 4.420603e-02 -2.012132
Raw.Material=PS 30.769231 6.1538462 13.61257 3.360691e-05 -4.147538
Impermeability=Type 2 4.347826 0.7692308 12.04188 1.043033e-11 -6.800436
$category$`Shape 2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 2 95.65217 48.88889 12.04188 2.151940e-15 7.932260
Raw.Material=PS 69.23077 40.00000 13.61257 1.004609e-07 5.325888
Supplier=Supplier A 41.37931 26.66667 15.18325 2.137281e-02 2.301333
Supplier=Supplier B 18.24324 60.00000 77.48691 2.374481e-03 -3.038894
Raw.Material=PP 16.66667 53.33333 75.39267 2.022645e-04 -3.716171
Impermeability=Type 1 13.69048 51.11111 87.95812 2.151940e-15 -7.932260
$category$`Shape 3`
NULL
$category$`Shape 4`
Cla/Mod Mod/Cla Global p.value v.test
Finishing=Hot Printing 9.677419 75 32.46073 0.0169336 2.388146
Finishing=Lacquering 1.550388 25 67.53927 0.0169336 -2.388146
$quanti.var
Eta2 P-value
Price 0.24285191 2.771217e-11
Diameter 0.23221716 9.994081e-11
Length 0.23112294 1.139178e-10
weight 0.19722569 5.965369e-09
nb.of.pieces 0.10533516 1.120672e-04
Mature.Volume 0.05693699 1.177220e-02
$quanti
$quanti$`Shape 1`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -3.721118 3.853846 4.115183 1.2716527 1.4132247 1.983430e-04
weight -5.069026 1.418671 1.714121 0.7867181 1.1728539 3.998565e-07
Length -5.469752 8.191771 10.329589 5.0759423 7.8647827 4.506649e-08
Diameter -5.489199 1.026728 1.294639 0.6306647 0.9821218 4.037612e-08
Price -6.344431 14.290942 16.552332 4.8726895 7.1724314 2.232495e-10
$quanti$`Shape 2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Diameter 6.616403 2.143782 1.294639 1.409033 9.821218e-01 3.680436e-11
Length 6.603559 17.116281 10.329589 11.260640 7.864783e+00 4.014035e-11
Price 6.176070 22.340911 16.552332 9.620363 7.172431e+00 6.571699e-10
weight 6.118100 2.651800 1.714121 1.698673 1.172854e+00 9.469770e-10
nb.of.pieces 2.865929 4.644444 4.115183 1.607698 1.413225e+00 4.157871e-03
Mature.Volume -2.005008 58355.222222 82206.026178 68318.473831 9.103190e+04 4.496223e-02
$quanti$`Shape 3`
NULL
$quanti$`Shape 4`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 3.078997 5.62500 4.115183 9.921567e-01 1.413225 0.002076987
Mature.Volume 2.629122 165250.00000 82206.026178 1.132649e+05 91031.901051 0.008560561
Price 2.044271 21.63988 16.552332 3.822103e+00 7.172431 0.040926790
attr(,"class")
[1] "catdes" "list "
catdes(raw_data, num.var=6)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 4.088669e-21 2
Shape 2.873602e-16 3
Supplier 1.088731e-02 2
$category
$category$`Type 1`
Cla/Mod Mod/Cla Global p.value v.test
Shape=Shape 1 99.23077 76.785714 68.06283 1.043033e-11 6.800436
Raw.Material=PP 97.91667 83.928571 75.39267 1.773212e-11 6.723573
Supplier=Supplier A 72.41379 12.500000 15.18325 1.301492e-02 -2.483361
Raw.Material=PS 30.76923 4.761905 13.61257 5.429478e-15 -7.816541
Shape=Shape 2 51.11111 13.690476 23.56021 2.151940e-15 -7.932260
$category$`Type 2`
Cla/Mod Mod/Cla Global p.value v.test
Shape=Shape 2 48.8888889 95.652174 23.56021 2.151940e-15 7.932260
Raw.Material=PS 69.2307692 78.260870 13.61257 5.429478e-15 7.816541
Supplier=Supplier A 27.5862069 34.782609 15.18325 1.301492e-02 2.483361
Raw.Material=PP 2.0833333 13.043478 75.39267 1.773212e-11 -6.723573
Shape=Shape 1 0.7692308 4.347826 68.06283 1.043033e-11 -6.800436
$quanti.var
Eta2 P-value
Diameter 0.47062626 6.604215e-28
Length 0.46804072 1.049429e-27
weight 0.45675032 7.728264e-27
Price 0.43301606 4.512224e-25
Mature.Volume 0.07171395 1.801495e-04
$quanti
$quanti$`Type 1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 3.691294 91225.988095 82206.026178 9.338486e+04 9.103190e+04 2.231162e-04
Price -9.070449 14.805996 16.552332 4.819967e+00 7.172431e+00 1.185272e-19
weight -9.315716 1.420835 1.714121 6.707159e-01 1.172854e+00 1.211330e-20
Length -9.430150 8.338742 10.329589 4.357114e+00 7.864783e+00 4.095012e-21
Diameter -9.456161 1.045344 1.294639 5.411724e-01 9.821218e-01 3.194554e-21
$quanti$`Type 2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Diameter 9.456161 3.115573 1.294639 1.449522 9.821218e-01 3.194554e-21
Length 9.430150 24.871426 10.329589 11.600832 7.864783e+00 4.095012e-21
weight 9.315716 3.856391 1.714121 1.708742 1.172854e+00 1.211330e-20
Price 9.070449 29.308174 16.552332 8.516118 7.172431e+00 1.185272e-19
Mature.Volume -3.691294 16321.086957 82206.026178 13496.587327 9.103190e+04 2.231162e-04
attr(,"class")
[1] "catdes" "list "
15) To simultaneously take into account quantitative and categorical variables in the clustering you should use the HCPC function on the results of the FAMD ones. FAMD stands for Factorial Analysis of Mixed Data and is a PCA dedicated to mixed data. Explain what will be the impacts of such an analysis on the results?
res.famd <- FAMD (raw_data, ncp = 5, graph = TRUE, sup.var = c(10), axes = c(1,2), row.w = NULL, tab.comp = NULL)
res.hcpc.famd <- HCPC(res.famd, nb.clust = -1
, graph = TRUE)
plot.HCPC(res.hcpc.famd, choice = 'map', draw.tree = FALSE, select = c(1,10), title = '')
raw_data_clust <- res.hcpc.famd$data.clust
res.famd_2 <- FAMD (raw_data_clust, ncp = 6, graph = TRUE, sup.var = c(1,5,7,10), axes = c(1,2), row.w = NULL, tab.comp = NULL)
res.famd_2$eig
coordinates <- as.data.frame(res.famd_2$ind$coord)
coordinates$Price <- raw_data_clust$Price
coordinates
17) If someone ask you why you did one global model and not one model per supplier, what is your answer?
Below we count the amount of data points we have for each supplier. We have to consider the advantages and the pitfalls of building a local vs. global models.
18) These data contained missing values. One representative in the compagny suggests either to put 0 in the missing cells or to impute with the median of the variables. Comment. For the categorical variables with missing values, it is decided to create a new category “missing”. Comment.
res.hcpc.famd$call$X$Dim.1
[1] -1.98063919 -1.83264787 -1.83106329 -1.78617733 -1.77810618 -1.76793474 -1.76710263 -1.75740608 -1.74971627 -1.74854492 -1.72234966 -1.66587800 -1.64757197 -1.64122934
[15] -1.63598508 -1.56390795 -1.53944334 -1.53334618 -1.48308272 -1.38978669 -1.31431198 -1.31360993 -1.30659672 -1.30106267 -1.24853337 -1.22972350 -1.22095845 -1.21456405
[29] -1.18418458 -1.18021014 -1.17935581 -1.17927296 -1.17545877 -1.16352170 -1.15631274 -1.15536248 -1.14893262 -1.14368481 -1.13904799 -1.11948077 -1.11592461 -1.11166467
[43] -1.10864477 -1.07786759 -1.07465242 -1.07231358 -1.05375226 -1.03073238 -1.01177639 -1.00275909 -1.00244847 -0.99953402 -0.99685597 -0.99557798 -0.99270885 -0.98055457
[57] -0.98041153 -0.97369797 -0.97319443 -0.96309915 -0.94797090 -0.93951350 -0.92381405 -0.92380588 -0.92312060 -0.92280368 -0.92215711 -0.91733190 -0.91127725 -0.90468216
[71] -0.88924490 -0.88886587 -0.88556501 -0.87235273 -0.86644337 -0.86479528 -0.83275001 -0.81147499 -0.77426760 -0.77398705 -0.77256095 -0.76784626 -0.76735493 -0.76500699
[85] -0.75906301 -0.74956767 -0.74385898 -0.72805741 -0.71957456 -0.70612214 -0.67838613 -0.67821432 -0.64991811 -0.64920144 -0.64269013 -0.63822303 -0.63582084 -0.63163680
[99] -0.62753059 -0.62599476 -0.62186560 -0.60480936 -0.58138172 -0.57247016 -0.56818608 -0.55787880 -0.55368923 -0.54139477 -0.53854440 -0.53480620 -0.52898642 -0.52824415
[113] -0.52616669 -0.52564460 -0.52454183 -0.52449868 -0.51962886 -0.51480527 -0.50739799 -0.49352372 -0.48049116 -0.47854325 -0.47511958 -0.45213593 -0.44366084 -0.41417752
[127] -0.40516410 -0.39864924 -0.39340526 -0.38440624 -0.38336093 -0.38001794 -0.37853964 -0.35220981 -0.34431429 -0.32536820 -0.31982441 -0.31803255 -0.30039211 -0.28421095
[141] -0.28417967 -0.27830495 -0.27577096 -0.26247151 -0.25453002 -0.24784546 -0.23849067 -0.23173017 -0.23034051 -0.22769867 -0.15845660 -0.05646856 -0.05377382 0.05889003
[155] 0.07158955 0.07913132 0.16276548 0.79185578 1.06680161 1.09360971 1.14437872 1.21748198 1.25067158 1.37109701 2.03126674 2.20958460 2.25060435 2.48114155
[169] 2.49540111 2.51611986 2.68062272 3.05674906 3.07969012 3.23130366 3.77350671 3.80195508 4.13207845 4.32016658 4.35027614 4.72999750 5.26696337 5.37357116
[183] 5.86802720 5.87862818 6.58628131 6.60222800 6.60582995 7.20144721 7.28664755 7.39353843 7.91577973